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Abstract 

A  heuristic  that  has  emerged  in  the  area  of  importance  sampling 
is  that  the  changes  of  measure  used  to  prove  large  deviation  lower 
bounds  give  good  performance  when  used  for  importance  sampling. 
Recent  work,  however,  has  suggested  that  the  heuristic  is  incorrect  in 
many  situations.  The  perspective  put  forth  in  the  present  paper  is  that 
large  deviation  theory  suggests  many  changes  of  measure,  and  that  not 
all  are  suitable  for  importance  sampling.  In  the  setting  of  Cramer’s 
Theorem,  the  traditional  interpretation  of  the  heuristic  suggests  a  fixed 
change  of  distribution  on  the  underlying  independent  and  identically 
distributed  summands.  In  contrast,  we  consider  importance  sampling 
schemes  where  the  exponential  change  of  measure  is  adaptive,  in  the 
sense  that  it  depends  on  the  historical  empirical  mean.  The  existence 
of  asymptotically  optimal  schemes  within  this  class  is  demonstrated. 

The  result  indicates  that  an  adaptive  change  of  measure,  rather  than 
a  static  change  of  measure,  is  what  the  large  deviations  analysis  truly 
suggests.  The  proofs  utilize  a  control-theoretic  approach  to  large  de¬ 
viations,  which  naturally  leads  to  the  construction  of  asymptotically 
optimal  adaptive  schemes  in  terms  of  a  limit  Bellman  equation.  Nu¬ 
merical  examples  contrasting  the  adaptive  and  standard  schemes  are 
presented,  as  well  as  an  interpretation  of  their  different  performances 
in  terms  of  differential  games. 
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1  Introduction  and  Background 


A  basic  technique  for  the  approximation  of  probabilities  and  functionals 
of  probability  measures  is  Monte  Carlo  simulation.  Let  A  be  a  random 
variable  taking  values  in  the  real  numbers.  To  estimate  EX,  a  sequence 
of  independent  and  identically  distributed  (iid)  copies  Ao,  Ai,  ...  of  A  are 
generated,  and  the  estimate  for  EX  based  on  the  first  K  samples  is  just  the 
sample  mean:  Qk  =  (Ao  +  X\  +  •  •  •  +  Xk~i)/K.  Since  the  convergence 
of  the  estimate  is  based  on  the  law  of  large  numbers,  a  standard  rate  of 
convergence  can  be  defined  in  terms  of  the  variance  var[Ao].  Indeed,  the 
variance  of  Qk  (assuming  that  var[Ao]  exists)  is  just  var[Ao]/A. 

In  cases  where  the  variance  is  large,  and  especially  if  it  is  large  compared 
to  EX,  it  may  take  a  large  number  of  samples  before  the  variance  of  the 
estimator  is  acceptably  small  relative  to  EX.  This  is  a  common  occurrence 
when  estimating  rare  events,  and  also  when  estimating  functionals  whose 
behavior  is  largely  determined  by  rare  events.  In  such  cases,  one  may  be 
tempted  to  use  some  form  of  importance  sampling  to  reduce  the  variance, 
and  hence  speed  up  the  computation  by  requiring  fewer  samples.  Since 
importance  sampling  is  most  effective  when  dealing  with  rare  events,  it  is 
perhaps  not  surprising  the  literature  on  importance  sampling  and  its  relation 
to  large  deviations  is  extensive. 

The  basic  formulae  of  importance  sampling  are  as  follows.  Suppose  that 
A  has  distribution  9,  and  consider  an  alternative  sampling  distribution  r. 
It  is  required  that  9  be  absolutely  continuous  with  respect  to  r,  so  that  the 
Radon-Nikodym  derivative  f{x)  =  {dO / dr)(x)  exists.  lid  samples  Ao,  Ai, . . . 
with  distribution  r  are  generated,  and  the  estimate 

1  k-i  _ 

Qk  =  TV  X!  Xkf(Xk) 

A  fc=0 

is  considered  in  lieu  of  Qk-  Since 

EXkf( Xk)  =  [  xf{x)r(dx )  =  [  x9(dx)  =  EX, 

Jr  Jr 

Qk  is  an  unbiased  estimate  of  EX,  with  a  rate  of  convergence  determined 
by 

var  [A0/(Aq)]  =  [  x2f(x)9(dx)  -  [  x6{dx)  . 

JM. 

The  optimization  of  this  quantity  over  all  possible  r  is  inappropriate.  For 
example,  suppose  9  is  supported  on  [0,  oo),  and  9(dx)  =  g(x)dx.  Let 
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m  =  EX  =  /0°°  xQ(dx)  and  r(dx)  =  m  1xg( x)  dx.  Then  6  is  absolutely 
continuous  with  respect  to  r,  with  f{x)  =  m/x.  Furthermore, 

var  [X0f(X0)]  =  f  x2f(x)0(dx)  -  m2  =  0. 

Jr 

However,  such  a  distribution  r  is  of  little  use  in  practice  since  it  requires 
knowledge  of  rri,  the  very  thing  we  want  to  estimate!  Instead  of  this  uncon¬ 
strained  optimization,  one  typically  seeks  to  minimize  over  parameterized 
families  of  alternative  sampling  distributions. 

When  the  distribution  of  Xq  is  connected  to  a  large  deviations  problem, 
certain  sampling  distributions  are  immediately  suggested  by  the  form  of  the 
large  deviations  variational  problem.  In  order  to  explain  this  connection  we 
specialize  to  the  setting  of  Cramer’s  Theorem,  which  will  be  used  throughout 
the  rest  of  the  paper.  However,  the  conclusions  we  will  draw  on  the  relations 
between  importance  sampling,  large  deviations  and  differential  games  hold 
in  much  greater  generality,  and  some  of  these  generalizations  will  be  pursued 
elsewhere. 

Let  Yo ,  Y\ , . . .  be  a  sequence  of  iid  Rd- valued  random  variables  with  dis¬ 
tribution  /i.  and  assume  the  moment  generating  function  JRd  exp(a,  y)y(dy) 
is  finite  for  all  a  G  Rd.  Let  Sn  =  Yq  +  ■  ■  ■  +  Yn_ For  Borel  sets  A  C  Rd, 
Cramer’s  Theorem  is  concerned  with  the  large  deviation  approximation  of 
pn  =  P{Sn/n  G  A}.  Let  1  a{v)  denote  the  function  equal  to  1  if  y  G  A 
and  zero  otherwise.  Suppose  that  for  some  fixed  value  n  we  consider  Monte 
Carlo  approximation  for  P  {Sn/n  G  A}.  In  terms  of  the  notation  introduced 
previously,  we  have  X  =  1  A{Sn/n),  and  straightforward  Monte  Carlo  would 
require  the  generation  of  many  independent  copies  of  X  (i.e. ,  of  Sn),  say 
(Xq,  Xi,  ,  Xk~ i).  The  rate  of  convergence  of  the  naive  estimator  Qk  is 
determined  by 

var  [Qk\  =  ( Pn~Pn)lK 2- 

If  pn  — >  0  as  n  — ►  oo,  this  variance  approaches  zero.  However,  the  relative 
error  is 


standard  deviation  of  Qk  \Jpn  —  Pn 

relative  error  =  - — — -  =  — - . 

mean  of  Qk  Kpn 

Since  \Jpn  —  Pn/Pn  oo,  a  large  sample  size  (i.e.,  I\)  is  required  for  the 
estimator  to  achieve  a  reasonable  relative  error  bound  (at  least  when  n  is 
large).  In  fact,  if  pn  scales  according  to  a  large  deviation  principle  then  K 
must  grow  exponentially  in  n  if  a  bounded  relative  error  is  to  be  maintained. 
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However,  under  the  assumed  condition  on  the  moment  generating  func¬ 
tion  Sn/n  satisfies  what  is  called  a  large  deviation  principle  (LDP).  Define 
the  convex  function 

H(a)  =  log  /  exp(a,  y)y(dy),  a  6  M.d,  (1.1) 

and  its  Legendre  transform 

L(/3)  =  sup  [(a,  (3)  -  H(a )]  ,  (3  6  Md.  (1.2) 


It  is  well  known  that  L  is  a  nonnegative,  proper,  strictly  convex  function 
[30],  [13,  Lemma  6.2.3],  and  that  L((3)  =  0  if  and  only  if  (3  =  fRdyy(dy). 
Suppose  the  set  A  has  the  property  that 

inf  L((3)  =  inf  L{(3),  (1.3) 

P&A°  v  '  /3e A 

where  A°  and  A  denote  the  interior  and  closure  of  A.  respectively.  Then 
[39]  we  have  the  large  deviation  approximation 

lim  —  log  P  {Sn/n  €  A}  =  —  inf  L((3). 

n— >oo  n 

Now  the  distribution  of  X  is  rather  complicated,  and  so  it  makes  sense  to 
consider  the  change  of  measure  with  respect  to  the  underlying  distribution 
y  of  the  Yi  instead.  It  is  at  this  point  that  the  large  deviation  theory 
suggests  a  specific  alternative  sampling  distribution.  To  distinguish  indices, 
we  henceforth  reserve  k  to  index  the  A:th  simulation  in  the  Monte  Carlo 
scheme,  and  i  and  j  for  the  index  in  the  sum  that  defines  Sn.  In  addition,  k 
will  appear  as  a  superscript,  and  i  and  j  as  subscripts.  Suppose  that  instead 
generating  Y-*  according  to  y,  we  generate  iid  sequences  { Y according  to 
a  distribution  v.  Suppose  that  we  further  restrict  to  measures  v  that  are 
related  to  y  by  an  exponential  tilt,  i.e.,  there  is  a  €  such  that 

v[dy)  =  e^y(dy)/Z{a),  Z(a )  =  eH^  =  \  e^y(dy). 

md 


We  then  form  S^/n  =  (Yq  +  •  •  •  +  Y^_^)/n,  so  that  1^*, /u£a}  plays  the  role 
of  Xk  above.  It  is  convenient  to  first  express  the  Radon-Nikodym  derivative 
in  terms  of  the  iid  (in  j )  variables  Y^  (rather  than  the  complicated  variable 
Sn/n),  and  so  we  arrive  at  the  estimator 


1 

K 


K—l 

12  l{S*/neA} 
k= 0 


n—  1 

n  ^e-{^)+H{a) 

3= 0 


1 

K 


E  1  {S^A}e<-{a^/n)+H{a)2 

k= 0 
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It  is  not  difficult  to  verify  that  this  estimator  is  unbiased. 

Recall  that  the  rate  of  convergence  of  the  estimator  is  determined  by  the 
variance  of  the  summands 


var 


1{S£/neA}e 


We  temporarily  drop  k  from  the  notation,  and  observe  that  to  minimize  the 
variance  (over  a)  it  suffices  to  minimize  the  second  moment 


E 


1  _  2  n(—(a,Sn/n)+H(a)) 

l{Sn/neA}e 


After  calculating  the  moment  generating  function  of  v  and  computing  its 
Legendre  transform,  one  can  verify  via  Cramer’s  Theorem  that  Sn/n  also 
satisfies  a  LDP,  but  with  L  replaced  by  L(/3)  =  L(f3 )  +  H(a)  —  (a,  (3).  A 
formal  application  of  Varadhan’s  Theorem  on  the  asymptotic  evaluation  of 
integrals  [13,  Theorem  1.3.4]  then  gives  the  large  n  approximation 


—  log  E 
n 


1  -  2  n(-(a,S„/n)+H(a)) 

i{5„/neWe 


-  jnf  [2 (a,  (3)  -  2 H (a)  +  L(f3)\ 


We  now  substitute  the  expression  for  L  and  optimize  over  a  to  obtain  the 
min/max  problem 


sup  inf  [(a,/3)  —  H(a )  +  L((3)]  . 

According  to  the  previous  discussion,  the  suprenrizing  a  should  identify  an 
asymptotically  optimal  sampling  distribution  among  all  changes  of  measure 
of  the  prescribed  form.  If  the  conditions  of  the  min/max  theorem  hold  (e.g., 
if  A  is  convex  and  bounded),  then  the  sup  and  inf  can  be  permuted,  and  we 
find  that 

sup  inf  [(a,  (3)  —  H(a)  +  L(/3)]  =  inf  (  sup  [(a,  /3)  —  H(a)]  +  L(/3)  ] 

Oi£S.dPeA  PeA  VaelD  J 

=  inf  2 L(/3). 

0eA 

Furthermore,  if  (3*  is  a  solution  to  the  problem  inf 0£aL(/3),  then  a  suprern- 
izing  a*  can  be  identified  as  the  conjugate  dual  of  /3*,  i.e.,  as  a  point  that 
maximizes  [(a,  (3*)  —  H(a)]  (assuming  that  this  maximum  exists). 

The  identification  of  the  optimizing  a  brings  in  a  further  interesting 
connection  with  the  theory  of  large  deviations.  In  the  traditional  proof  of 
Cramer’s  Theorem  (as  well  as  many  other  large  deviation  results),  the  lower 
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bound  is  proved  by  a  change  of  measure  argument,  and  the  upper  bound 
by  a  somewhat  involved  application  of  Chebyshev’s  inequality  [39].  Our 
interest  is  in  the  lower  bound.  Suppose  that  the  rate  function  L  has  been 
guessed  (or  that  it  has  been  suggested  by  an  existing  upper  bound),  and 
that  one  wishes  to  estimate  P  {Sn/n  £  A}  from  below.  Then  one  considers 
all  exponential  changes  of  measure  on  the  underlying  distribution  /i  which 
shift  the  mean  to  (3*  £  A.  By  estimating  the  asymptotically  dominant  (as 
n  — ►  oo)  term  in  the  Radon-Nikodym  derivative,  one  can  find  a  lower  bound 
for  each  such  change  of  measure,  and  then  optimize  to  obtain  the  tightest 
lower  bound.  Although  we  will  not  repeat  the  details  of  the  proof  here,  it 
turns  out  that  the  correct  change  of  measure  is  also  characterized  as  the 
exponential  tilt  associated  with  a  supremizing  (or  nearly  supremizing)  point 
ina->  [(a,  (3*)  —  H{a)\. 

Thus  the  asymptotically  optimal  change  of  measure  that  is  used  to  prove 
the  large  deviation  lower  bound  formally  coincides  with  the  distribution 
suggested  by  the  preceding  argument  for  use  in  importance  sampling.  This 
formal  connection  has  been  made  rigorous  in  certain  circumstances,  and 
consequently  given  rise  to  the  following  heuristic:  the  change  of  measure 
used  to  prove  the  large  deviation  lower  bound  should  be  a  good  (perhaps 
nearly  optimal)  distribution  to  use  for  purposes  of  importance  sampling.  The 
first  result  of  this  type  was  given  by  Siegmund  [37].  The  basic  idea  was 
subsequently  investigated  in  many  contexts,  and  a  small  selection  of  the 
literally  hundreds  of  papers  on  the  topics  is  [1,  2,  3,  7,  8,  9,  11,  12,  14,  23, 
25,  28,  29,  31,  32,  35].  Necessary  and  sufficient  conditions  under  which  a 
prescribed  scheme  is  asymptotically  optimal  are  discussed  in  [10,  33,  34], 
while  [26]  gives  a  survey  of  rare-event  simulation. 

However,  more  recent  work  has  challenged  the  validity  of  the  heuristic. 
For  example,  in  [24]  it  is  asserted  that  if  one  uses  the  change  of  measure 
suggested  by  large  deviations,  then  in  certain  situations  the  corresponding 
importance  sampling  scheme  has  very  poor  properties.  Examples  are  given 
to  show  that  it  can  even  perform  worse  than  the  standard  Monte  Carlo 
method!  A  simple  example  given  in  [24]  considers  Cramer’s  Theorem  in  one 
dimension,  with  A  =  (— oo,a]  U  [b,  oo),  and  EYq  £  (a,  b).  Note  that  the  set 
A  here  is  not  convex,  and  so  the  application  of  the  min/max  theorem  above 
is  no  longer  valid.  Under  these  circumstances  infg6j4  L((3)  must  be  achieved 
at  a  or  b,  and  conditions  are  used  to  imply  0  <  L(b)  <  L(a),  so  that  the 
optimal  change  of  measure  in  the  large  deviation  lower  bound  is  one  that 
centers  the  mean  under  the  new  distribution  on  b.  When  one  simulates  un¬ 
der  this  new  distribution  the  overwhelming  majority  of  the  sample  means 
end  close  to  b  (for  large  n).  There  are,  however,  occasional  “rogue”  sample 
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means  Sk/n  that  end  in  the  set  (oo,  o],  and  the  indicator  l(_00ja](S'^/n), 
appropriately  multiplied  by  the  Radon-Nikodym  derivative,  appears  in  the 
estimator.  Unfortunately,  it  turns  out  that  these  Radon-Nikodym  deriva¬ 
tives  are  very  large,  and  in  fact  large  enough  that  even  though  the  outcome 
S^/n  6  (—00,  a]  is  very  unlikely  (exponentially  small  with  exponent  propor¬ 
tional  to  n),  the  term 


l{5^/ne(— oo,a]} 


en(-{a,S*/n)+H(a)) 


dominates  the  variance  as  n  — >  00.  Rrdeed,  the  second  moment  of  this 
term  alone  could  be  (exponentially)  larger  than  e~2nL(b)  when  n  — »•  00.  For 
example  (see  [24]  for  more  details),  take  /1  ~  N(  1, 1)  and  —a  =  b  >  1.  We 
have  H(a)  =  a  +  a2/ 2,  L(/3)  =  {(3  —  l)2/2,  mfggj4  L(/3)  =  L(b)  <  L(a).  The 
change  of  measure  will  shift  the  mean  to  b,  or  equivalently  a  =  a*  =  b  —  1. 
It  is  not  difficult  to  verify  that 


—  —  log  E 


n 


{S^/n€{- 00, a]} 


,2  n(-{a,S*/n)+H{ot)) 


-  inf  [(a*,P)  -H(a*)  +  L(J3)] 

p<a 

=  —b2  +  26  +  1. 


This  term  is  smaller  than  2 L(b)  =  b2  —  2b  +  1  if  b  >  2. 

In  spite  of  the  examples  of  [24],  one  must  examine  the  claims  carefully 
before  concluding  that  large  deviations  has  little  to  say  in  such  situations. 
The  key  issue,  from  our  point  of  view,  is  our  somewhat  misleading  use  of 
the  word  “the”  in  the  phrase  11  the  change  of  measure  used  to  prove  the 
large  deviation  lower  bound” .  It  turns  out  that  there  are  many  changes  of 
measure  that  can  be  used  to  prove  the  lower  bound,  and  one  must  consider 
this  larger  class  if  one  hopes  to  identify  importance  sampling  schemes  that 
work  well  in  great  generality.  Observe  that  the  change  of  measure  used  in 
the  large  deviation  lower  bound  treats  all  summands  similarly,  in  that  the 
same  shift  of  distribution  from  /i  to  v  is  used  for  all  T)fc.  However,  one  could 
also  consider  shifts  of  the  underlying  distribution  /i  that  dynamically  adapt, 
in  that  the  measure  used  to  generate  Yf  could  depend  on  the  outcomes 
Yjk,j  =  0, . . .  ,i  —  1.  This  distinction  corresponds,  in  control  terminology,  to 
the  difference  between  “open-loop”  and  “feedback”  controls.  The  early  pa¬ 
per  [12]  utilizes  an  adaptive  importance  sampling  scheme  to  estimate  certain 
escape  probabilities  for  a  Markov  chain,  and  proves  asymptotic  optimality 
for  a  particular  one-dimensional  problem.  However,  the  techniques  used 
are  not  broadly  applicable.  The  paper  [14]  uses  adaptive  importance  sam¬ 
pling  to  estimate  functionals  of  a  small  noise  diffusion,  though  no  proofs  of 
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optimality  are  given.  The  distinction  between  these  two  methods  of  impor¬ 
tance  sampling  was  articulated  in  Sadowsky  [33],  where  he  refers  to  adaptive 
schemes  as  “sequential.” 

There  is  an  alternative  approach  to  large  deviations  that  is  based  on 
stochastic  control  methodology  in  which  both  upper  and  lower  bounds  are 
proved  simultaneously.  The  use  of  stochastic  control  and  logarithmic  trans¬ 
forms  to  prove  large  deviation  type  results  goes  back  to  Fleming  [20],  and  has 
been  investigated  in  many  different  situations  since  then  [13,  15,  18,  22,  36]. 
In  this  approach,  the  quantity  of  interest  [e.g.,  —  (log P{Sn/n  £  A})/n  in 
Cramer’s  Theorem]  is  represented  as  the  minimal  cost  Un  for  a  stochastic 
control  problem.  The  large  deviation  rate  also  has  a  representation  as  the 
minimal  cost  U  for  a  control  problem,  though  in  the  case  of  Cramer’s  The¬ 
orem  it  is  a  deterministic  control  problem.  The  large  deviation  asymptotics 
then  correspond  to  the  convergence  of  minimal  costs:  Un  — *  U . 

Because  the  limit  control  problem  that  defines  U  is  deterministic  (at 
least  in  the  case  of  Cramer’s  Theorem),  the  use  of  open  loop  controls  as 
“comparison  controls”  is  acceptable  for  the  purpose  of  proving  Un  — >  U.  It  is 
for  this  reason  that  simple  changes  of  measure  can  be  used  to  prove  the  large 
deviation  lower  bound  in  the  traditional  change  of  measure  argument.  In 
other  words,  one  can  always  find  a  “nearly  optimal”  control  from  among  the 
class  of  open  loop  controls  as  n  — ►  oo.  However,  in  analyzing  the  optimality 
of  importance  sampling  schemes  we  must  study  the  asymptotics  of  a  small 
noise  stochastic  game ,  rather  than  a  control  problem.  The  connection  with 
the  game  is  introduced  in  the  next  section,  and  further  discussed  in  Sections 
3.1  and  3.2.  In  the  setting  of  stochastic  games,  and  even  for  small  noise 
stochastic  games,  open  loop  controls  are  not  “nearly  optimal”  except  in 
special  circumstances.  As  a  consequence,  to  come  close  to  optimality  in 
a  general  situation  one  must  consider  importance  sampling  schemes  that 
adapt  the  sampling  distribution  in  the  course  of  simulating  each  trajectory 
indexed  by  k. 

The  paper  is  organized  as  follows.  In  Section  2  we  give  the  defini¬ 
tion  of  asymptotic  optimality,  and  show  that  adaptive  importance  sampling 
schemes  designed  to  minimize  the  second  moment  are  asymptotically  opti¬ 
mal.  Section  3  discusses  an  alternative  formal  PDE  approach  to  the  adaptive 
scheme,  and  describes  a  method  for  the  construction  of  a  single  asymptot¬ 
ically  optimal  adaptive  scheme.  Two  numerical  examples  are  also  included 
in  Section  3.3.  Certain  technical  proofs  are  deferred  to  the  Appendix  to  ease 
exposition. 


2  Statement  and  Proof  of  the  Main  Result 

Consider  a  probability  space  (£l,E,P)  and  a  family  of  events  {An}  with 

lirn  -log P{An}  =  -7, 

n— >00  77, 

for  some  7  >  0.  A  general  formulation  of  importance  sampling  for  this 
problem  can  be  described  as  follows.  In  order  to  estimate  P{An},  a  generic 
random  variable  Zn  is  constructed  such  that  P{An}  =  EZn.  Independent 
replications  Zff^1)  of  Zn  are  then  generated,  and  we  obtain  an 

estimator  by  averaging: 

f\K  _j_  +  %h  d  b  1 

Un  ~  R 

The  estimator  is  unbiased,  i.e.,  EQ%  =  P{An}.  The  rate  of  convergence 
associated  with  this  estimator  is  determined  by  the  variance  of  the  sum¬ 
mands,  or  equivalently,  their  second  moment  E[(Z „)2].  The  smaller  the 
second  moment,  the  faster  the  convergence,  whence  the  smaller  sample  size 
K  required.  However,  it  follows  from  Jensen’s  inequality  that 

limsup - log E[{Z^)2]  <  lim - log  (ez£\  =27. 

n — >00  Tl  n  71  ^  ' 

The  estimator  Q %  is  said  to  be  asymptotically  optimal  if  the  upper  bound 
is  achieved,  i.e.,  if 

=  (2,1) 

In  order  to  illustrate  the  main  idea  we  will  return,  for  the  remainder  of 
the  paper,  to  the  setup  of  Cramer’s  theorem. 

Remark  2.1  Since  the  performance  of  the  estimator  Q ^  is  completely  de¬ 
termined  by  the  second  moment  of  its  generic,  iid  building  block  Z^.  we  will 
drop  the  superscript  k  hereafter.  Note  that  n  does  not  stand  for  sample  size. 
Also,  unless  noted  otherwise,  the  integral  sign  will  always  denote  an  integral 
over  Md. 

Let  a  probability  measure  //  on  be  given,  and  define  H  and  L  by 
equations  (1.1)  and  (1.2),  respectively.  Let  A  C  Rd  be  given.  We  wish  to 
estimate  the  probability  pn  =  P{Sn/n  £  A},  and  make  use  of  the  following 
assumption. 

Condition  2.1  H(a)  <  00  for  all  a  £  Md,  and  equation  (1.3)  holds. 
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Under  this  assumption, 


lim  ——log pn  =  inf  L(f3 )  =  inf  L(/3)  =  inf  L(f3).  (2.2) 

n—too  n  0eA°  y  j  0eA  v  '  I3eA  v  ' 


We  consider  a  family  of  changes  of  measure  which  are  related  to  0, 
by  an  exponential  tilt.  To  this  end,  we  introduce  the  following  control 
problem.  Let  a  collection  of  Borel  measurable  functions  (controls)  an  = 
(')> 3  =  0, 1, . . . ,  n  —  1  j  be  given.  Then  the  state  dynamics  are  governed 
by 

j-1  _ 

Sj  =  E  J  =  0, 1, . . . ,  n  —  1, 

i= 0 

where  YJ1  is  conditionally  distributed,  given  { Y™,i  =  0, 1, . . . ,  j  —  1},  ac¬ 
cording  to 

Vj(dy)  =  exp  | /n),yj  -  H  (a”(5"/n))|  n{dy). 


In  other  words,  the  sampling  distribution  is  allowed  to  depend  on  the  his¬ 
torical  empirical  mean.  The  estimator  of  pn  =  P{Sn/n  G  A}  is  defined  as 
the  average  of  independent  replications  of 


Xn  = 


I  r  _  ,  eE”=c \{-{^^/n),Y?)+H^{S-/n))) 

1{5™/ne^}e 


Our  goal  is  to  minimize  the  second  moment,  hence  the  variance,  of  the 
summands  Xn  by  judiciously  choosing  the  control  an.  Thus  we  consider  the 
value  function  defined  by 


Vn  =  inf  E[Xl\  =  inf  E 


l{SZ/nEA}e 


X)”=01(-2(^(5jl/n),y/)+2if(c 


q(s*/n))) 


We  also  consider  the  log  transform 

Wn  =  ——log  Vn.  (2.3) 

n 

The  following  result  asserts  the  asymptotic  optimality  (or  near  asymp¬ 
totic  optimality)  of  the  dynamic  change  of  measure  defined  by  any  minimiz¬ 
ing  sequence  of  controls  (respectively,  nearly  minimizing  sequence  of  con¬ 
trols)  in  the  definition  of  Vn.  Of  course,  one  would  like  to  define  a  change  of 
measure  that  does  not  explicitly  depend  on  n.  Such  a  change  of  measure  is 
naturally  suggested  by  the  deterministic  control  problem  that  characterizes 
the  limit  of  the  Vn,  and  we  will  remark  further  on  this  in  Section  3.2. 
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Theorem  2.1  Assume  Condition  2.1,  and  define  Wn  by  (2.3).  Then 

lim  Wn  =  2  inf  L(3). 

n—>o o  /3eA 


The  proof  of  the  theorem  calls  for  the  following  min/max  lemma,  which 
is  of  independent  interest.  Recall  that  for  a  probability  distribution  7  on 
Md,  the  relative  entropy  R( 7||/x)  is  defined  by 

=  /  lo 

dp, 

if  7  <C  /U,  and  J?(7||/x)  =  00  if  otherwise.  Let 

C  =  1 7  :  7  is  a  probability  distribution  on  Md,  and  ^(7 1|  m)  <  00 1 . 

The  convexity  of  relative  entropy  in  the  first  argument  [13,  Lemma  1.4.3] 
implies  that  C  is  a  convex  set. 


Lemma  2.2  Suppose  that  f  :  — >  M  is  a  bounded  measurable  function. 

Then 


sup  inf 

a£Rd 7 eC 


J  f(yh(dy)  +  R(t IIaO  +  J («,  y)i{dy)  -  H(a ) 


=  inf  sup 

7GC  QeMd 


j  f(yh(dy)  +  RfiyWn)  +  J (a,y)i{dy)  -  H(a ) 


The  proof  of  this  lemma  is  deferred  to  the  Appendix. 


Proof  of  Theorem  2.1.  The  unbiasedness  of  the  estimator  associated  with 
Xn  and  Jensen’s  inequality  imply  that 

Vn  >  inf  (EXnf  =  Mp2n=1t 

an  an 

Equation  (2.2)  and  the  last  display  then  give 

lim  sup  Wn  <  lim  — —  logp^  =  2  inf  L((3). 

“  n-+ 00  n  0&A 

It  remains  to  show  the  reverse  inequality 


lim  inf  IT"  >  2  inf  L(3). 

n^o o  0£A 
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We  will  use  the  weak  convergence  approach  as  developed  in  [13].  To 
analyze  the  asymptotics  of  Wn,  we  represent  it  as  the  value  function  for  a 
stochastic  control  problem.  To  this  end,  we  first  extend  the  state  dynamics 
so  as  to  write  down  a  dynamic  programming  equation  (DPE).  The  “state 
variable”  will  be  the  normalized  quantity  Sf  /n.  Abusing  notation  a  bit,  for 
x  €  Rd,  i  G  {0, . . . ,  n},  define  the  state  dynamics 

i-1  _ 

Sij  =  nx  +  J2  Yi‘b  j  = 

l=i 


Here  YY  is  conditionally  distributed,  given  {Y?g,£  =  i, . . . ,  j  —  1},  according 
to 

verity)  =  exp  {(a](SY/n),y)  -  H  ^(^/n))}  n(dy). 
Similarly,  define 


Vn(x,i )  =  inf  E 


1  f  .  eE;=i1(-2<-”(^/n),T"}+2if(a«,(5l:,/n))) 


and 


{SIJneAY 


Wn(x,i )  = - log  Vn(x,i). 

n 


(2.4) 

(2.5) 


Note  that  the  original  case  corresponds  to  x  =  0  and  1  =  0,  that  is 


vn  =  vn(o,  o),  r  =  r(  o,o),  s^  =  s^n 

Also  observe  the  terminal  conditions 


Vn(x,n )  =  1  a(x),  Wn(x,n )  =  oo  •  1a(x)- 

Since  it  is  inconvenient  to  study  the  problem  with  an  oo  terminal  con¬ 
dition,  we  instead  work  with  a  mollified  version  of  the  control  problem.  Let 
F  :  ->  R  be  an  arbitrary  bounded  and  continuous  function.  Suppose 

that  Vp  is  defined  as  in  (2.4),  save  that  the  indicator  function  lr^„  /neA\ 

is  replaced  by  exp{— 2nF(S™n/n)}.  Similarly  define 

Wp(x,i)  =  -  -log  VP(x,i). 
n 


It  is  not  difficult  to  see  that 

\W2(x,i)\  <  2HFIU,  e-2-HFll-  <  vp(x,i)  <  e2n^ 


(2.6) 
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Indeed,  the  first  inequality  is  implied  by  the  second  inequality,  while  the 
latter  is  implied  by 


inf  E 

an 


=  l. 


To  see  that  this  is  true,  first  consider  the  case  an  =  0.  Since  H( 0)  =  0,  the 
right  hand  side  is  bounded  below  by  the  left  hand  side.  Next  consider  any 
control  sequence  an.  By  Jensen’s  inequality 


E 


>  E 
=  1. 


E;=i1(-(«”(^/n),T">+ff(aW(5P./n))) 


n2 


6^3 


Thus  the  left  hand  side  is  bounded  below  by  the  right  hand  side. 

It  follows  from  [5]  that  Vp  satisfies  the  Bellman  equation 

VP(x,  i)  =  inf  J  e-^,y)+2H(a)yn  ^  +  i  +  ^  e(«,v)-H(a) 

=  inf  J  e-{<*,y)+H{*)vn  ^  +  ly,  Z  +  l)  n(dy), 

together  with  the  terminal  condition  Vp(x,n)  =  exp{— 2nF(x)}.  It  follows 
that 

Wp(x,i)  =  —  —  log inf  [  e-{a’y)+H{a)e-nW2(x+^y’i+1)y{dy), 
n  a  jRd 

and  that  Wp(x,n)  =  2 F(x).  The  functions  Vp  and  Wp  are  both  Borel- 
measurable  (see  the  Appendix  for  the  proof). 

The  relative  entropy  representation  for  exponential  integrals  [13,  Lemma 
1.4.2]  states  that 


—  log  J  e  Ex)  jj  ^dx)  =  inf 


^(7  IIaO  + 


for  all  bounded  measurable  functions  /.  A  direct  application  of  this  repre¬ 
sentation  to  Wp  is  not  valid  since  the  function  y  —*■  (a,  y)  is  not  bounded. 
Nonetheless,  an  extension  to  cover  the  types  of  unbounded  functions  we 
must  consider  is  possible,  and  a  proof  of  this  fact  is  given  as  Lemma  4.1  in 
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the  Appendix.  Applying  this  extended  version  of  the  representation  gives 

Wp(x,  i)  =  sup  inf  [  Wp  (x  +  -y,  i  +  l)  i(dy) 
aeRd7  eC[J  V  n  J 

+  ~  (-^(7  IIm)  +  /  (a,y) -f(dy)  —  H(a)^j  .  (2.7) 

The  last  display  shows  that  Wp  has  an  interpretation  as  the  value  function 
for  a  small  noise  stochastic  game.  We  will  return  to  this  point  in  Subsection 
3.1.  The  functional  on  the  right-hand  side  is  concave  with  respect  to  a  and 
convex  with  respect  to  7,  and  the  sets  and  C  are  both  convex.  Since 
the  set  C  is  in  general  non-compact,  the  min/max  theorem  [38]  cannot  be 
applied  directly.  However,  Lemma  2.2  shows  that  the  interchange  of  inf  and 
sup  is  still  valid  in  this  case.  In  other  words, 

Wp(x,  i )  =  inf  sup  [  Wp  (x  +  -y,  i  +  l)  7 (dy) 

7  eCa&tdU  V  n  J 

+  ^  (r{i  11/0  +  J  (“» y)  7 (dy)  -  H(a)j 

=  7gc  Jwf{x  +  ~y^ i  +  x)  i(dy) 

+  n  +L  (/  s/tC^z/)))  •  (2-8) 

Equation  (2.8)  implies  that  Wp(x,i)  also  has  an  interpretation  as  the  mini¬ 
mal  cost  of  a  stochastic  control  problem  of  the  same  general  form  as  before. 
To  simplify  the  notation,  we  state  the  control  problem  only  for  the  case 
i  =  0.  The  control  problem  will  be  defined  on  a  probability  space  (H,  jF,  P), 
and  Ex  will  denote  that  the  initial  condition  of  the  state  process  is  x.  An 
admissible  control  is  a  sequence  {7”,  j  =  0, 1, . . . ,  n  —  1},  with  each  7"  be  a 
stochastic  kernel  on  given  Rd.  Given  an  admissible  control  sequence,  the 
state  dynamics  are  defined  by  Sq  =  nx  and 

on  _j_  on  1  \rn 

Dt+1  —  Dj  '  1 j  ’ 

where 

P  {YJ1  edy\Y?,0<i<  j}  =  p  {y?  G  dy  \  S™/n)  =  ^(dy  \  S?/n). 


We  then  define  value  function 
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where  the  infimum  is  over  all  controls  {7"}  and  resulting  controlled  processes 
{Sj'/n}  that  start  at  x  at  time  0.  Since  Vp  also  satisfies  the  DPE  (2.8)  [5, 
Chapter  8]  and  terminal  condition  Vp(x,n)  =  Wp(x,n )  =  2 F(x),  we  obtain 
by  induction  that  Wp(x,  i )  =  Vp(x,  i )  for  all  x  G  and  i  G  {0, . . . ,  n}. 
Define  a  stochastic  kernel  7”  on  M.d  given  [0, 1]  by 

_,n/ ilD  =  /  if  t  e\j/n,(j +  l)/n),  j  =  0,l,...,n-2 

\  In-iidy)  if  t€[(n-  1  )/n,  1] 

and  (abusing  notation  a  bit)  define  the  process  Sn  =  {Sn(t),  t  €  [0, 1]}  as  the 
piecewise  linear  interpolation  of  the  controlled  sequence  {S'”}.  Let  A  denote 
Lebesgue  measure.  Then  the  definition  of  7n(dy|f)  and  the  convexity  of  L 
imply  that 

W$[x,  0) 

=  Vp(x,  0) 


=  inf '  eJrWWwXI+lI  f  [  yin(dyxdt))+2F(S?Jn)}, 
{7"}  L  vo  Jmd  J  \ 


where  7 n{dy  x  dt)  =  ^n(dy\t)dt  and  (ii®\)(dy  x  dt )  =  y(dy)dt.  A  straight¬ 
forward  weak  convergence  approach  will  be  adopted  to  derive  the  desired 
inequality  (2.9)  below.  Since  the  proof  is  essentially  the  same  as  [13,  Theo¬ 
rem  5.3.5],  we  only  give  a  sketch. 

For  each  e  >  0,  there  exist  a  sequence  of  controls  {7n,n  G  N}  such  that 

W£(x,0)+£>EX  LR(7n||M®  A)  +  l(  f1  [  V1n{dy  X  dt))  +  2F(S”/n) 

Vo  JRd  ) 

for  every  n.  Furthermore,  since  L  is  non-negative  and  F  is  bounded,  {7”} 
is  indeed  uniformly  integrable  in  the  sense  of  [13,  Proposition  5.3.2],  For 
any  subsequence  of  {7n,n  G  N},  we  can  extract  a  weakly  convergent  sub¬ 
subsequence,  still  denoted  by  {7”},  such  that  7”  =>  7  for  some  stochastic 
kernel  7  whose  second  marginal  is  Lebesgue  measure.  We  utilize  the  Sko- 
rokhod  representation  [6],  which  allows  us  to  assume  (when  calculating  the 
limits  of  the  integrals)  that  the  convergence  is  actually  w.p.l.  It  follows 
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from  the  lower  semicontinuity  of  R,  the  convergence  yn  =>  7,  and  the  uni¬ 
form  integrability  that  w.p.l 

lirninf  R  (7n||/x  <g>  A)  >  R( 7||/x  <g>  A), 

linr  [  [  y-)n{dy  x  dt)  =  [  f  y^y(dy  x  dt), 
n  Jo  J«.d  Jo  J] Rd 

and 

Sn/n  —  Z  — >  0,  where  Z(t)  =  x  +  j  [  y'y(dy\s)ds, 

Jo  Js.d 

where  7(dy  x  ds)  =  7(dy|s)ds.  The  convergence  in  the  last  display  is  with 
respect  to  the  supremum  norm.  In  addition,  the  convexity  of  R(v\\fi)  in  v 
and  Jensen’s  inequality  imply 

#(7ll/i®  A)  =  R('y(-\t)\\n(-))dt  >  R  ^  7(-|t)dt||/x(-)^  - 

The  uniform  integrability  of  [13,  Theorem  5.3.5]  also  implies  Jq  JRd  ||y||7(dyx 
dt)  <  00  w.p.l,  and  hence 

[  [  .yi{dy\t)dt  =  [  [  yj(dy\t)dt. 

Jo  J Rd  J Md  Jo 

By  Fatou’s  Lemma  and  the  lower-semicontinuity  of  L  [30],  we  have 
linr  inf  W%(x,  0)  +  £ 

n 

>  ex  R(^Jo  tOIO^IIM-))  +l  (yJRdJ0  yi(dy\t)dt^  +2F(z(i)) 

Using  the  identity 

inf  R(v\\n)\  j  yv{dy)  =  fJ  =  L{(3) 

[13,  Lemma  3.3.3],  it  is  elementary  to  show  that  the  right  hand  side  of  the 
last  inequality  is  bounded  below  by 

2  inf  [L(J3)  +  F{x  +  (3)]  . 
peRd 

Since  e  >  0  is  arbitrary, 

lirninf  Wp(x,  0)  >  2  inf  [L(/3)  +  F(x  +  /?)], 
n— >0°  /3eMd 
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and  in  particular 


lirninf  W2(0,0)  >  2  inf  [L(0)  +  F(/3) }  .  (2.9) 

rl— >°°  0£Rd 


Now  let  Fj(y)  =  j(d(y,A )  A  1).  Since  1  a{v)  <  exp{-2nFJ(y)}, 
lim  inf  Wn  =  lim  inf  Wn  (0,0) 

n — >oo  n — >oo 

>  lim  inf  W2.(0,  0) 

>  2  inf  [!/(/?)  +  -Fj(/?)] . 

/3eMd 

Exactly  as  on  [13,  pages  10-11],  a  compactness  argument  shows  that 
lim  inf  {L(/3)  +  Fj(f3)}  =  inf  L(/3), 

>oo 

and  we  complete  the  proof.  ■ 


For  a  scalar  a  let  [aj  denote  the  integer  part  of  a. 

Proposition  2.1  Assume  Condition  2.1,  and  define  Wn(x,i)  as  in  (2.5), 
except  in  (1.3)  we  replace  A  with  (A  —  x)/{l  —  t).  Then 

lim  Wn(x,  [tn\ )  — ►  2 U(x,t), 

n — xx) 

where 

U(x,  t)  =  inf{(l  —  t)L(P)  :  x  +  (1  —  t)/3  G  A}. 

Proof.  The  proof  is  essentially  the  same  and  thus  omitted.  ■ 

Remark  2.2  The  proof  of  Theorem  2.1  actually  implies  a  more  general 
result.  If  we  write  Wp  =  Wp{ 0,  0),  then 

bW  =  2  inf  [L(/3)  +  F(/3)] 

?wo°  /3eMd 

for  all  bounded  and  continuous  functions  F  :  — ►  M.  Indeed,  since  (2.9) 

gives  the  reverse  inequality,  all  we  need  to  show  is  that 

lim  sup  Wp  <  2  inf  [L(/3)  +  F(f5)}. 

n  /3eMd 

However,  by  definition 

!T£  =  —  —  log  Vp , 
n 
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where  Vp  =  Vp  (0, 0)  is  defined  by 

Vp  =  inf  E  [(Xn,Ff 

with 

Xnp  =  e-nF(Sn/n)eE^U-{a7^7/n^r)+H(^(Spn)))_ 

For  each  control  an,  it  is  not  difficulty  to  verify  that  the  resulting  XnF  is  an 
unbiased  estimator  for  E’fexpj—  nF(Sn/n)}].  It  then  follows  from  Jensen’s 
inequality  that 

W$  <  —  —  log  inf  E  \(XnF )21  =  -  -log  Ee~nF{Sn/n\ 
n  an  L  ’  J  n 

Thanks  to  the  large  deviations  properties  of  Sn/n,  this  implies  [13,  Theorem 

1.2.1] 

lim  sup  Wp  <  limsup--log  Ee~nF^/n)  <  2  inf  [L(J3)  +  F(J3)]. 

n  n  Tl  /3eRd 

Similar  to  Proposition  2.1,  we  have  the  following  more  general  version, 
lim Wp(x,  \  tn\)  =  2 UF(x,t) 

where 

UF(x,t)=  inf  {(1  -  t)L(J3)  +  F{x  +  (1  -  t)/3)}.  (2.10) 

(3e  Rd 

3  Examples  and  Further  Remarks 

3.1  The  formal  PDE  approach  and  the  limit  control  problem 

Although  Theorem  2.1  establishes  that  there  are  asymptotically  optimal  [in 
the  sense  of  equation  (2.1)]  adaptive  importance  sampling  schemes,  it  does 
not  explicitly  discuss  the  construction  of  such  schemes,  or  even  schemes  that 
are  approximately  asymptotically  optimal. 

In  this  subsection,  we  will  discuss  an  alternative  PDE  approach  to  study 
the  adaptive  importance  sampling  scheme.  The  approach,  though  largely 
formal,  will  shed  light  on  how  to  construct  asymptotically  optimal  and  nearly 
optimal  controls  an.  Some  numerical  confirmation  of  this  approach  is  pre¬ 
sented  in  Section  3.3. 

For  a  general  measurable  function  F,  and  assuming  the  limit  exists, 
define 

WF(x,t)  =  lim  Wp(x,  l tn\ ). 
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Remark  2.2  asserts  that 


WF(x,t)  =  2  UF(x,t) 


(3.1) 


for  all  continuous  and  bounded  functions  F.  while  Proposition  2.1  confirms 
the  validity  of  (3.1)  for  F  =  00I4.  We  will  re-derive  equation  (3.1)  below 
by  formally  arguing  that  WF  and  2 UF  satisfy  the  same  PDE. 

To  obtain  this  PDE,  we  rewrite  (2.7)  as 

0  =  sup  inf  f  AWp(y)'y(dy )  +  -  (r{ tIIm)  +  [ (a,y)j(dy)  -  R(a))  , 
aeRd  jeC  L J  n\  J  J  \ 


where 

A W$(y)  =  WF  (x  +±y,i  +  ij-  W$(x, i). 

Suppose  that  a  t  subscript  denotes  the  partial  derivative  with  respect  to 
t,  and  that  an  x  subscript  denotes  the  vector  of  partials  with  respect  to 
Xi,  i  =  1 , ,d.  Formally,  we  have  the  approximation 

A  W$*s-(WF)t  +  -{(WF)x,y). 
n  n 

Inserting  this  into  the  DPE  leads  to 


0  =  sup  inf 

aeMd  76 c 


( WF)t  + 


y)  l{dy)  +  R(r/\\n)  +  /  (a,  y)l{dy )  -  H(a ) 


We  have  already  noted  that 


inf 


R( 7ll/^) :  J vi{dy) 


p 


m- 


Therefore, 


0  =  (WF)t  +  sup  inf  [((WF)X,  P)  +  L(P)  +  (a,  P)  -  H(a)] . 

ae]Rd  (3€Rd 

Such  an  equation  is  called  an  Isaac’s  equation,  and  it  is  well  known  that 
such  equations  are  satisfied  by  the  value  function  for  a  differential  game. 
We  will  not  give  the  rather  lengthy  and  detailed  formal  definition  of  the 
game  here,  but  simply  note  its  general  features.  Because  of  the  intervening 
minus  sign,  the  maximizing  player  is  actually  trying  to  minimize  the  variance 
through  the  choice  of  control  a(t)  6  Rd.  The  minimizing  player  appears  due 
to  the  large  deviations  approximation  of  the  variance,  and  chooses  p{t)  6 


19 


M.d.  The  dynamics  (j>(t)  =  f3(t )  only  involve  the  minimizing  player,  the 
running  cost  L(f3 )  +  (a,  (3)  —  H{a)  involves  both,  and  the  terminal  cost 
is  2F(cj)(l)).  The  adaptive  importance  sampling  scheme  will  be  defined  in 
terms  of  the  optimal  control  for  the  maximizing  player,  which  one  would 
prefer  to  obtain  in  feedback  form.  Representing  the  infimum  in  terms  of  the 
Legendre  transform  H  of  L  gives 

0  =  (WF)t  +  sup  \-H{-a  -  ( WF)X )  -  H(a )]  . 

ae  Rd 

The  strict  convexity  of  H  implies  that  the  supremum  is  uniquely  achieved 
at 

a*{x,t)  =  -{WF)x{x,t)/2,  (3.2) 

and  that  WF  should  satisfy 

0  =  (WF)t  -  2H(-(Wf)x/2).  (3.3) 

Lastly,  we  observe  that  WF  should  also  formally  satisfy  with  terminal  con¬ 
dition  WF(  x,  1)  =  2  F(x). 

Next  we  formally  derive  the  PDE  for  UF.  Owing  to  the  convexity  of  L, 
UF  is  the  value  function  of  the  deterministic  control  problem 

UF(x,t)  =  inf  [  L(4>(s))  ds  +  F(<f>(  1))  ,  (3.4) 

</>  Ut 

where  the  infimum  is  over  all  absolutely  continuous  (j)  which  satisfy  (j)(t )  =  x. 
A  standard  dynamic  programming  argument  implies  that  UF  satisfies  [4,  21] 
the  DPE 

0  =  (UF)t  +  inf  [ L(a )  +  (a,  (UF)X)}  =  (UF)t  -  H{-(UF)X),  (3.5) 

a£Rd 

with  terminal  condition  UF(x,  1)  =  F( x). 

Comparing  the  PDE  (3.3)  with  (3.5),  and  noting  WF(x,  1)  =  2 UF(x,  1), 
we  conclude  that  (3.1)  holds  if  either  PDE  has  a  unique  solution.  Further¬ 
more,  note  that  a*  defined  by  (3.2)  also  satisfies 

a*(x,t)  =  -(WF)x(x,t)/2  =  -( UF)x(x,t ).  (3.6) 

3.2  Implementation  issues 

One  approach  to  the  construction  of  optimal  or  nearly  optimal  adaptive 
importance  sampling  schemes  (i.e.,  selection  of  the  control  an ),  would  be  to 
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solve  (numerically  if  need  be)  the  DPE  associated  with  Wp.  However,  any 
such  scheme  would  directly  depend  on  n,  and  in  general,  one  would  prefer 
schemes  without  this  dependence. 

An  alternative  approach,  which  is  to  be  discussed  in  this  subsection,  is  to 
consider  the  DPE  associated  with  the  limit  problem  Wp  (equivalently  Up). 
In  the  preceding  subsection,  we  formally  characterized  Wp  as  the  solution 
to  an  Isaac’s  equation,  and  hence  as  the  value  function  for  a  differential 
game.  We  also  showed  that  Wf  can  be  characterized  as  the  value  function 
of  a  deterministic  optimal  control  problem,  which  is  often  easier  to  solve  or 
approximate  numerically. 

The  equation  (3.6)  identifies,  at  least  formally,  an  optimal  feedback  con¬ 
trol  policy.  However,  this  observation  is  not  totally  satisfactory  for  several 
reasons.  The  first  is  that  even  if  we  have  an  exact  formula  for  Uf  (or  Wf), 
the  DPE  (3.5)  is  usually  satisfied  only  in  a  weak  sense,  in  which  case  the  par¬ 
tial  derivative  ( Uf)x  may  not  be  defined  for  all  times  and  spatial  points.  Un¬ 
der  additional  conditions  one  can  show  that  the  set  of  points  where  a*(x,  t) 
is  well  defined  is  open  and  dense  [19].  At  points  where  the  gradient  is  not 
defined  there  are  often  superdifferentials,  and  a  natural  replacement  for  the 
DPE  suggests  feedback  controls  defined  in  terms  of  extreme  superdiffer¬ 
entials  rather  than  gradients.  Nonetheless,  a  comprehensive  theory  is  not 
available.  The  second  reason  is  that  Uf  (or,  Wf)  does  not  usually  have 
an  explicit  solution,  and  so  in  many  cases  a  numerical  approximation  is  re¬ 
quired.  Convergent  numerical  approximations  have  been  studied  extensively 
(see,  e.g.,  [27]).  Practical  experience  and  some  theory  (e.g.,  [16]),  have  shown 
that  the  numerical  schemes  which  construct  (provably)  convergent  approxi¬ 
mations  to  Uf  (or,  Wp)  also  yield  nearly  optimal  feedback  controls,  though 
they  are  usually  limited  by  practical  implementation  constraints  to  low  di¬ 
mensional  problems. 

Our  goal  in  this  subsection  is  to  formally  characterize  the  optimal  control 
a*  at  all  points  (x,  t)  through  the  dual  relation  in  the  Legendre  transform 
(1.2).  It  is  straightforward  to  see  from  the  control  problem  (3.4)  that  an 
optimal  control  at  (x,  t)  is  the  minimize!’  in  equation  (2.10),  say  /3*(x,t), 
thanks  to  the  convexity  of  L.  Note  that  the  existence  of  (3*(x,  t)  is  guaranteed 
with  very  mild  conditions,  e.g.,  F  is  bounded  and  continuous.  This  implies 
that  — (U f)x(%,  t)  and  (3*(x,t )  are  conjugate.  It  follows  from  (3.6)  that 

a*(x,t)  is  conjugate  to  the  minimizer  (3*{x,t)  in  (2.10). 

At  points  where  ( Uf)x(x ,  t)  exists  this  definition  gives  a*(x,  t)  =  —(Uf)x(x,  t) 
At  points  where  {Uf)x{xU)  does  not  exist  there  are  multiple  minimizing 
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/3*(x,t),  and  one  should  define  a*(x,t)  though  conjugacy  in  any  Borel  mea¬ 
surable  way. 


Remark  3.1  In  the  setting  of  Cramer’s  theorem,  F  =  oo  •  1  In  this  case, 

(3*(x,  t )  £  argmin{(l  —  t)L{(3 )  :  x  +  (1  —  t)/3  G  A}, 

and  a*(x,  t )  is  its  conjugate.  This  also  implies  that,  at  time  (x,  t ),  the  change 
of  measure  associated  with  a*(x,t)  actually  shifts  the  mean  to  /3*(x,t). 
Roughly  speaking,  this  means  that  the  controlled  random  walk  always  points 
to  the  “most  likely”  end  point,  given  that  the  end  point  is  in  A  and  that  we 
are  currently  at  x  with  1  —  t  units  of  time  to  go.  Note  that  we  can  assume 
without  loss  that  A  is  closed  under  the  conditions  of  Proposition  2.1.  This 
suffices  to  guarantee  the  existence  of  for  all  x  £  Rd,f  £  [0, 1). 

3.3  The  relations  between  blind  and  adaptive  importance 
sampling  and  differential  games  in  the  small  and  the 
large 

In  the  adaptive  sampling  scheme,  the  sampling  distribution  is  allowed  to  be 
a  function  of  the  current  state  of  the  “controlled  empirical  mean”  and  the 
time  index.  In  contrast,  the  scheme  suggested  by  the  standard  heuristic  in 
the  setting  of  Cramer’s  Theorem  is  a  fixed  change  of  measure  throughout.  In 
a  setting  more  general  than  Cramer’s  Theorem  (e.g.,  the  models  of  Chapter 
6  of  [13])  the  scheme  suggested  by  the  standard  heuristic  might  also  depend 
on  the  time  index,  though  not  on  any  controlled  state  (in  other  words,  it 
is  an  “open  loop”  control).  In  both  cases,  a  logarithmic  transform  would 
be  used  to  characterize  the  asymptotic  behavior  of  the  second  moment, 
and  this  logarithmic  transformation  introduces  another  control  through  the 
variational  formula  for  relative  entropy  (cf.  [13]  and  Section  2). 

In  the  case  of  adaptive  sampling,  the  control  that  tries  to  minimize  the 
second  moment  maximizes  in  the  logarithmic  transform,  owing  to  an  inter¬ 
vening  minus  sign.  The  control  that  large  deviations  introduces  attempts  to 
minimize,  and  an  examination  of  the  dynamic  programming  equation  shows 
that  at  each  time  step  the  “large  deviation  control”  gets  to  see  the  variance 
control  used  to  generate  the  next  sample,  thus  giving  it  the  “information 
advantage.” 

Thus  in  the  limit  n  — >  oo  we  expect  to  obtain  a  differential  game  with 
the  advantage  given  to  the  minimizing  player,  the  so-called  “lower  game” 
(see,  e.g.,  [17]).  We  refer  to  such  games  as  games  “in  the  small,”  since  in  the 
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discrete  time  prelimit  game  the  players’  selections  of  controls  are  interleaved 
and  sequential. 

In  the  case  of  the  standard  heuristic  the  limit  game  is  quite  different. 
Here  the  maximizing  player  (the  player  who  selects  the  sampling  distribu¬ 
tion)  must  show  his  entire  “open  loop”  control  to  his  opponent,  who  can 
then  minimize  given  knowledge  of  the  opponent’s  control  for  all  t  G  [0, 1]. 
Such  a  game  will  be  referred  to  as  the  “lower  game  in  the  large.”  It  is  easy 
to  show  that  the  “lower  game  in  the  large”  is  never  greater  than  the  “lower 
game  in  the  small.”  When  these  two  coincide,  one  would  expect  that  the 
standard  heuristic  and  the  adaptive  importance  sampling  scheme  have  the 
same  (large  deviation)  asymptotic  properties,  even  though  there  could  still 
be  a  significant  difference  in  performance  for  any  given  value  of  n.  How¬ 
ever,  when  there  is  a  gap  between  the  values  of  the  two  games  the  standard 
heuristic  should  not  be  asymptotically  optimal,  and  in  fact  the  relative  error 
in  the  standard  heuristic  should  grow  exponentially  with  n. 


3.4  Examples 

In  this  section  we  include  some  numerical  examples  to  illustrate  the  sub¬ 
tle  pitfalls  of  the  blind  (i.e. ,  standard  heuristic)  importance  sampling.  All 
the  simulations  are  performed  with  standard  S-Plus  software,  and  the  time 
difference  between  blind  importance  sampling  and  adaptive  importance  sam¬ 
pling  is  negligible.  Each  simulation  takes  at  most  a  few  seconds. 

Example  1:  Let  Yq.  Y±, ...  be  a  sequence  of  iid  N(l,  1)  random  variables,  and 
Sn  =  Yq  +  •  •  •  +  Yn_ i  the  partial  sum.  Consider  the  set 


A  =  (— oo,  a]  U  [ b ,  oo) 


with  a  <  1  <  b.  We  want  to  estimate  P{Sn/n  G  A}  by  importance  sampling. 
It  is  not  difficult  to  verify  that 


H(a )  =  logE 


,aY0 


1  2 
—  ol  +  —a 


L(P)  =  sup  [a/3  -  H(a )]  =  \{P  -  if 


It  follows  from  Cramer’s  theorem  that 

lim  —  log  P{Sn/n  G  A}  =  —  inf  L(f3). 
n  n  0&A 

From  now  on,  we  will  assume  a  +  b  <  2,  which  implies  that  (3*  =  b  achieves 
the  minimum  of  L  over  A. 
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The  blind  importance  sampling  identifies  the  change  of  measure  as  the 
exponential  tilt  associated  with  the  supremizing  point  a*  in  a  — >  a/3*—H(a), 
which  is 

a*  =  (3*  -  1  =  b  —  1. 

In  other  words,  the  new  probability  measure  is 

u{dy )  =  ea*y-H^*\i{dy)  =  ea*y~H ^  •  -^e-^^dy  = 

V27T  v27t 

which  is  just  the  probability  measure  associated  with  N(b,  1).  The  algorithm 
starts  by  generating  iid  sequences  {Tjfc}  according  to  N(b,  1),  and  then  forms 
the  estimator 


1 

K 


K- 1 

H  !{ Sfi/n£A} 
k= 0 


n— 1 

[e -a*Yf+H{a*) 

i= 0 


1 

K 


K—l 


J2  1{S^/neA}e 
k= 0 


S*+nH(a*) 


The  rate  of  convergence  of  this  estimator  is  determined  by  the  second  mo¬ 
ment  of  the  summands,  say  Vn,  which  satisfies  (see  the  discussion  in  the 
Introduction) 

lim  --  log  Vn  =  inf  [a*/3  -  H(a *)  +  L(0) } . 
nn  /3eV  J 


If  (3*  achieves  the  infimum,  then  the  right  hand  side  equals  2L(/3*).  which  is 
the  optimal  rate  for  the  second  moment.  However,  this  is  not  always  true. 
Indeed,  it  is  not  difficult  to  verify  that 

lim  —  —  log  Vn  =  -(a  +  b-2)2  -  (b  -  l)2  <  2 L(JT) 
n  n  2 

if  a  +  3b  >  4. 

In  the  numerical  simulation  below,  we  take  n  =  60,  K  =  5000,  a  =  0.75, 
b  =  1.2.  The  true  value  of  the  probability  we  want  to  estimate  is 


pn  =  l  -$((&-  l)v^)  +  $((o  -  l)Vn)  =  8.7%. 


The  table  below  presents  the  outcome  of  4  simulations. 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  pn 

15.42 

7.81 

5.96 

6.02 

Standard  Error 

4.70 

1.75 

0.12 

0.12 

95%  Confidence  Interval 

[6.02,  25.82] 

[4.31,  11.31] 

[5.72,6.20] 

[5.78,6.26] 

Number  of  “Rogue”  Trajectory 

4 

1 

0 

0 

(-log  Vn)/(~\ogpn) 

-1.3 

-0.14 

1.6 

1.6 

Table  1. 
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In  simulations  No.  1  and  No.  2,  the  presence  of  “rogue”  trajectories 
(see  the  Introduction)  greatly  raises  the  standard  errors  associated  with  the 
estimates.  Indeed,  in  each  of  these  two  simulations,  the  proportion  of  the 
estimate  of  the  second  moment  due  to  these  few  “rogue”  trajectories  is  more 
than  99%.  In  simulations  No.  3  and  No.  4,  however,  there  are  no  “rogue” 
trajectories,  and  the  standard  error  associated  with  the  estimate  is  decep¬ 
tively  small.  The  reason  is  that  the  standard  error  is  itself  estimated  from 
the  sample.  Without  “rogue”  trajectories,  we  actually  underestimate  the 
standard  error.  Therefore,  we  cannot  put  much  confidence  in  the  standard 
errors  thus  obtained,  or  in  the  “tight”  confidence  intervals  that  follow.  Note 
that  the  confidence  intervals  from  these  two  simulations  do  not  contain  the 
true  value. 

The  adaptive  importance  sampling,  on  the  other  hand,  yields  more  ac¬ 
curate  estimates  and  its  performance  is  much  more  stable.  Below  is  the 
numerical  result.  The  standard  errors  (almost  the  same  across  different 
simulations)  are  much  smaller  compared  to  those  of  the  blind  algorithm. 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  pn  (%) 

8.66 

8.59 

8.72 

8.66 

Standard  Error  (%) 

0.06 

0.06 

0.06 

0.07 

95%  Confidence  Interval  (%) 

[8.54,  8.78] 

[8.47,  8.71] 

[8.60,8.84] 

[8.52,8.80] 

(-log  yn)/(-iogp„) 

1.51 

1.45 

1.49 

1.39 

Table  2. 


The  selection  of  a  nearly  optimal  control  an  =  {q"'(-)  :  j  =  0,1, ,  n—  1} 
was  discussed  in  the  preceding  sections.  It  was  formally  argued  in  Subsection 
3.1  that  the  associated  limiting  differential  game  indicates 

aj(x)  =  —Ux(x,j/n), 


is  a  good  choice,  where 

U (x,  t)  =  inf{(l  —  t)L(P)  :  x  +  (1  —  t)/3  £  A}. 


In  this  example,  it  is  not  difficult  to  verify  that 


U  ( x ,  t ) 


if  x  >  b  —  (1  —  t)  or  x<a  —  (1  —  t) 
if  (a  +  b)/2  —  (1  —  t)  <  x  <  b  —  (1  —  t) 
if  a  —  (1  —  t)  <  x  <  (a  +  b)/2  —  (1  —  t) 
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and 

(  0  ,  if  x  >  b  —  (1  —  t)  or  x  <  a  —  (1  —  t) 

—Ux(x,t)  =  l  —  if  (a  +  b)/2  —  (1  —  t)  <  x  <  b  —  (1  —  t)  . 

\  jEf  ~  1,  if  a  —  (1  —  t)  <  x  <  (a  +  b)/2  —  (1  —  t) 

At  points  where  —Ux  is  not  defined  any  extreme  superdifferential  may  be 
used  in  lieu  of  Ux.  In  all  cases  this  is  the  same  as  using  the  maximizing  a 
in  a  — »  a/3*(x,t)  —  H(a),  where 

(3*(x,  t)  €  argmin{(l  —  t)L((3)  :  x  +  (1  —  t)(3  €  A}. 

See  Subsection  3.2  for  more  discussion. 

To  summarize,  the  distribution  of  Yj'"'  given  { Y, ]k,  i  =  0, ...  ,j  —  1}  is 
normal  with  variance  1  and  mean  —Ux(x,j/n)  +  1  =  j3*(x,j/n),  with  x  = 

(y0*  + •••  +  *?_!)/«. 

The  following  table  illustrates  the  asymptotic  optimality  of  the  adaptive 
importance  sampling  as  n  — >  oo. 


n  =  100 

n  =  200 

n  =  500 

Theoretical  pn 

2.90  x  10'^ 

2.54  x  lO”3 

3.88  x  10~° 

Estimate  pn 

2.98  x  10-^ 

2.57  x  10_i> 

3.86  x  10'° 

Standard  Error 

0.06  x  10'^ 

0.04  x  10_il 

0.07  x  10_o 

95%  Confidence  Interval 

[2.86,3.10]  x  10“^ 

[2.49,2.65]  x  lO"3 

[3.72,4.00]  x  10"° 

(-log  yn)/(-iogp„) 

1.65 

1.84 

1.93 

Table  3. 


Example  2:  Let  {Yi  =  (Y^0,!^1)}  be  an  iid  sequence  of  two-dimensional, 
normally  distributed  random  vectors  with  mean  0  and  covariance  matrix  I, 
and  Sn  =  Yq  +  •  •  •  +  Y^_i  the  partial  sum.  For  a,  (3  G  M2  we  have 


H(a)  =  log  E 


>To> 


=  2  a 


L((3)  =  sup  [{a,  (3)  -  H(a)\  =  -\ 


Consider  the  set 

A={y  =  (y°,  y1)  :  ||  y  +  (a,  0)||2  =  (y°  +  a)2  +  ( y lf  >  r2}  , 
where  0  <  a  <  r  are  constants.  Then  Cramer’s  Theorem  yields 

lim-log P{Sn/n  G  A)  =  -  inf  L(fi)  =  -L((3*)  =  -]-(r  -  a)2 , 
n  n  p&A  2 
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with  minimizing  (3*  =  (r  —  a,  0).  It  is  not  difficult  to  identify  the  supremizing 
a  in  a  —>  (a,  (3*)  —  H (a)  as 

a*  =  (3*  =  (r  —  a,  0), 


and  the  blind  importance  sampling  will  sample  under  the  new  probability 
measure 

v{dy)  =  eK>2 ^(dy)  =  -^e-'^^f^dy, 
or  the  distribution  of  N(/3*,I).  As  before,  the  estimator  is 


K—l 


n—  1 


o-(a*,S*)+nH(a*) 


K  ^  hsti/neA}  II  e 

fc=0  i= 0 

where  S%  =  Yq  +  ■  ■  ■  +  Y^_x  and  {Yf}  are  iid  N(f3*,T)  random  vectors. 

It  is  not  difficult  to  verify  that  the  second  moment  Vn  associated  with 
the  summands  satisfies 


lim - \ogVn  =  inf  [(a*,/3)  —  H(a*)  +  L((3)  1  =  inf 

nn  PeA  '  K  n  0£  A 


h\(3  +  a*\\2-\\a*\\2 


If  we  further  assume  that  0  <  a  <  r / 2,  then  the  infimum  is  not  achieved  at 
(3*,  but  at  (— r  —  a,  0).  In  fact,  (3*  achieves  the  maximum  of  the  right  hand 
side  over  (3  £  dA.  In  this  case,  we  have 

lim  —  —  log  Vn  =  2a2  —  (r  —  a)2  <  2 L(/3*)  =  (r  —  a)2. 

n  n 

In  the  table  below,  we  take  n  =  60,  K  =  5000,  r  =  0.5,  a  =  0.05.  The 
theoretical  value  of  the  probability  can  be  obtained  as  follows: 


pn  =  P{Sn/n  £  A}  =  P{\\Sn/^/n+  (V«a,0)||2  >  nr2}. 


Since  Sn/y/n  is  IV(0, 1)  distributed,  \\Sn/^fn  +  (^/na,  0) || 2  is  a  non-central 
chi-square  random  variable  with  2  degree  of  freedom  and  noncentrality  para¬ 
meter  ||  (y^no,  0)  || 2  =  na2.  Its  distribution  is  available  in  standard  statistics 
softwares  like  S-Plus.  Here  we  have 


pn  =  8.97  x  10~4. 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  pn  (xl0— 4) 

6.38 

6.04 

7.26 

11.09 

Standard  Error  (xlO-4) 

0.78 

0.42 

1.17 

5.40 

95%  Confidence  Interval  (xlO- 4) 

[4.82,  7.94] 

[5.20,  6.88] 

[4.92,  9.60] 

[0.29,21.89] 

(-log  yn)/(-iogp„) 

1.41 

1.56 

1.33 

0.96 

Table  4. 
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In  this  example,  we  do  not  have  a  clear-cut  definition  of  a  “rogue”  tra¬ 
jectory.  The  main  idea,  however,  remains  the  same.  There  are  a  continuum 
of  possibilities  for  exiting  away  from  j3*  and  building  up  a  large  Radon- 
Nikodym  derivative,  and  to  varying  degrees  these  trajectories  degrade  the 
estimate.  As  the  table  above  shows,  both  the  estimated  probabilities  and 
estimated  second  moments  vary  considerably  across  different  simulations. 

The  adaptive  importance  sampling  again  outperforms  the  blind  algo¬ 
rithm.  Below  are  the  simulation  results,  and  the  estimates  are  clearly  more 
accurate  and  stable. 


No.  1 

No.  2 

No.  3 

No.  4 

Estimate  pn  (xlO- 4) 

9.29 

8.85 

8.90 

9.18 

Standard  Error  (xlO-4) 

0.21 

0.24 

0.18 

0.28 

95%  Confidence  Interval  (xlO- 4) 

[8.87,  9.71] 

[8.37,  9.33] 

[8.54,  9.26] 

[8.62,  9.74] 

(-log  Tn)/(-log  pn) 

1.82 

1.78 

1.84 

1.75 

Table  5. 


The  choice  of  control  an  is  obtained  as  before.  One  can  compute  the 
function  U  explicitly  and  then  let  a”(x)  =  — Ux(x,j/n ),  as  in  Example  1. 
Alternatively,  as  discussed  in  Subsection  3.2,  we  can  use  that  —Ux  is  the 
maximizing  a  in  (a,  f3*(x,t)}  —  H(a),  where 

(3*(x,t)  =  argmin{(l  —  t)L((3 )  :  x  +  (1  —  t)(3  G  A}  . 


It  is  not  difficult  to  check  that 


-Ux(x,t)  =  /3*(x,t) 


0  ,  if  x  G  A 

(||»+(a,0)||  -1)  •  [^  +  M)],  if  xgA. 


In  other  words,  the  distribution  of  Y*  given  {Yf,  i  =  0,  •  •  ■ ,  j  —  1}  is  normal 
with  variance  I  and  mean  f3*(x,j/n)  when  x  =  (Yq  +  •  •  •  +  Y^_-^)jn. 

Below  are  more  simulation  results,  which  illustrate  the  asymptotic  opti¬ 
mality  of  the  adaptive  importance  sampling  as  n  tends  to  infinity. 


n  =  40 

n  =  80 

n  =  120 

Theoretical  pn 

8.49  x  10'a 

1.00  x  10~4 

1.40  x  10"° 

Estimate  pn 

8.64  x  10'a 

0.98  x  10~4 

1.39  x  10~° 

Standard  Error 

0.25  x  10~3 

0.02  x  10~4 

0.03  x  10-° 

95%  Confidence  Interval 

[8.14,9.14]  x  10~a 

[0.94,1.02]  x  10"4 

[1.33, 1.45]  x  lO-0 

(-log  Tn)/(-log  pn) 

1.65 

1.85 

1.92 

Table  6. 
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4  Appendix 

Lemma  4.1  Let  y  be  a  probability  measure  on  and  let  f  :  >  M  be 

bounded  and  measurable.  Define  H(a)  by  (1.1),  and  assume  H(a )  <  oo  for 
all  a  G  Rd.  Then  for  each  a  €  Md, 

-log  J 

=  r  _.mf  f-R(7ll/*)+  [  f(vh(dy)+  f  (a,y)'y(dy)  -  H(a). 

{7:R(7||/i)<oo}  L  J  J  J 

Proof.  Fix  a  G  Rd,  and  define  a  change  of  probability  measure  by 

—  (y)  =  e-(a,y)-H(~a) ' 
dy 

It  follows  that 

-  log  J  e-<a.i/>+»(«)-/(y)Ai(d2/) 

=  -  14(a)  -  H{—a)  -  log  J  e~f^0{dy) 

=  -H(a)-H(-a)+  inf  [-Rfrll#)  +  f  fivh(dy). 

{7:i?(7||0)<oo}  L  1  J 

Into  this  expression  we  insert 

R(-f\\0)  =  J  log^d'y 

=  /log$‘i7+/log^<i7 
=  Rfi)\\y)  +  H(—a)  +  J  (a,  y)'y(dy). 

It  remains  to  show  that 

{7  =  R(l\\0)  <  00}  =  {7  :  H(7||/x)  <  00}. 

Indeed,  if  R( 7 ||/Lt)  <  00  and  Condition  2.1  hold  then  by  [13,  Lemma  1.4.3] 
/  \\y\\^y(dy)  <  00.  It  then  follows  that  f?(7||$)  <  00.  The  proof  for  the 
reverse  is  exactly  the  same.  ■ 

Proof  of  the  Borel  Measurability  of  V'fi  and  W'f.  We  only  need  to 
show  the  Borel-measurability  of  Vfi(x,  i),  and  to  do  this  we  will  use  induction 
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on  i.  Clearly  Vft(x,n)  =  exp  {—2 nF(x)}  is  Borel  measurable.  Suppose  that 
Vp(x,i  + 1)  is  Borel  measurable.  Since  the  infimum  of  countably  many  Borel 
measurable  functions  is  still  Borel  measurable,  it  suffices  to  show  that 

V£(x,i)  =  inf  /  e-<«,s/>+^(“)y«  (x  +  -y,i  +  l)  n(dy), 

a£Qd  J  V  n  ) 

where  Q  is  the  set  of  rationals.  Indeed,  for  any  a  G  Rd,  there  exist  a  sequence 
{am}  G  Qd  such  that  ||am||  <  2||a||  and  am  — >  a.  Using  (2.6)  we  have  the 
bound 

e-(am,y)+H(am)yn  ^  }_y,  ^  <  e2||a||-|D||+sup{if(a):||a||<2||a||}  .  e2n||F||00^ 

and  the  right  hand  side  is  integrable.  The  Dominated  Convergence  Theorem 
implies 

/ 

converges  to 

J  e-(a,y)+H(a)yn  ^  +  +  ^ 

and  thus  the  infimum  may  be  restricted  to  Qd.  ■ 


e-(am,y)+H(am)yn  ^  +  ^-y,i  +  1^  y(dy) 


Proof  of  Lemma  2.2.  Define 


v  =  sup  inf 

!7GC 


f(y)i(dy)  +  Rin/Wv-)  +  /  («,  y)i(dy)  -H(a) 


and  similarly  v  when  the  order  of  inf  and  sup  are  exchanged.  The  min/max 
inequality  yields  v  >  y.  It  suffices  to  show  the  reverse  inequality. 

For  an  arbitrary  constant  M  <  oo,  we  have 


v  >  sup  inf 

||a||<M7GC 


f(y)ry{dy)  +  R('y\\y)  +  J  (a,y)'y(dy)  -  H(a) 


inf  sup 

7GC  ||a||<M 


=  V 


M 


f(yh(dy)  +  R(l\\tJ>)  +  J  (a,y)'y(dy)  -  H(a) 


The  interchange  of  the  infimum  and  supremum  is  valid  thanks  to  the  min/max 
theorem  [38],  the  convexity  of  C  and  { a  :  ||a||  <  M},  and  the  compactness 
of  {a  :  ||ci!||  <  M}. 
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We  will  use  the  definition 


Lm{P)  =  sup  [(a,  (3)  -  H (a)]  . 

||a||<M 

Observe  that  Lm  (f  yn(dy))  =  0,  since  0  <  Lm  <  L  and  L  (f  yy(dy))  =  0. 
We  also  define  the  level  set 

V  =  1 7  :  7  is  a  probability  measure  on  Md,  and  R( 7||/i)  <211/11001  C  C, 
which  is  independent  of  M.  We  can  write 

HM  =  inf  j  f{y)l(dy)  +  R(  j\\n)  +  LM  VI  (dy)^  . 

It  follows  that  vM  <  H/lloo  by  taking  7  =  //.  Since  Lm  is  non- negative  and 
/  is  bounded, 

7gc\i?  /  f(y')'y(dy')  +  ^11^)  +  Lm  (/  VLidy)^  > -ll/lloo+211/lloo  >  TM, 

which  implies  that  the  infimum  over  C  is  the  same  as  the  infimum  over  V. 
Thus 

~M  =  \ev  J  +  ^(tIIm)  +  lm  i^J  yiidy^j  . 

We  will  further  argue  that  there  exists  a  G  V  that  achieves  the 
infimum.  To  this  end,  we  will  associate  the  space  of  probability  measures 
with  the  r-topology,  which  is  the  smallest  topology  under  which  the  mapping 

7-/ h(yh(dy) 

is  continuous  for  every  bounded  and  measurable  function  h.  The  level  set 
T>  is  not  only  compact  in  the  weak  topology,  it  is  also  compact  under  the 
r-topology  [13,  Section  9.3].  Suppose  now  {7]^  :  m  >  1}  is  a  minimizing 
sequence.  The  compactness  of  V  implies  the  existence  of  a  subsequence,  still 
denoted  by  {7]$},  such  that 

7m  7 mi  for  some  7 m  G  £*• 

However,  since  the  mapping 

7  ^  R(ih) 


31 


is  lower  semicontinuous  (since  the  r-topology  is  finer  than  the  weak-convergence 
topology),  we  have 


Rinrhh)  <  liminfi?(7^||^). 


Furthermore,  we  have 

J  f(y)lM(dy)  -»•  J  fiyhhidy), 

from  the  definition  of  the  r-topology  and  the  boundedness  of  /,  while 
Lm  (y  mM(.dy)j  -*•  lm  (/  yi*M{dy)^  , 

thanks  to  the  uniform  integrability  of  {7^}  [13,  Proposition  5.3.2]  and  the 
continuity  of  Lm-  It  follows  readily  that  7^  is  a  minimizer. 

Again,  thanks  to  the  compactness  of  L\  there  exists  a  subsequence  of 
{7 still  denoted  by  {7m},  such  that  — ►  7*  G  X>  under  the  r-topology, 

for  some  7*  G  D.  Fix  an  arbitrary  positive  constant  e.  Thanks  to  the 
lower-semicontinuity  of  i?(-||/x),  the  definition  of  the  r-topology,  and  the 
boundedness  of  /,  there  exists  M\  >  0  such  that  for  all  M  >  Mi, 

^(7mIIm)  -  #(7*ll^)  > 

J  f(y)lh{dy)  -  J  f{y)^*{dy)  >  -e. 

Furthermore,  by  the  definition  of  L,  there  exists  an  a*  G  such  that 

L  y7*(dy))  <  (a*,/ 2/7*  (dy))  -  +  e. 

Since  /  y^*M{dy)  — ►  /  y^*{dy)  thanks  to  the  uniform  integrability  of  {7m}, 
there  also  exists  M2  such  that 

(a*,  J  yih{dy >  ^a*,y  yj*(dy)j  -  e, 

for  all  M  >  M2.  Using  the  definition  of  Lm  as  a  Legendre  transform,  for 
M  >  max{||a:*||,  Mi,  M2}  we  have 

y  >  yM 

=  y  f{y)ih{dy)  +  R(i*M\\y)  +  lm  ( j  yihidy)^ 
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> 

J  f{y)l*{dy)  +  R{1*\\pi)  +  LM  ( 

J  yj*M(dy) 

)-2e 

> 

J  f{y)i*{dy)  +  R{  7*||/x)  +  (a*, 

J 

)  -  H(a*)  -  2s 

> 

J  f(yh*(dy)  +  R(j*\\h)  +  (a*, 

J  yj*(dy 

-  H(a*)  -  3e 

> 

J  f(yh*(dy)  +  R(1*\\p)  +  L^J 

yi*{dy)^j  - 

■4s 

> 

inf 

7  eC 

J  f(yh(dy)  +  R(t\\ij)  +  L  | 

(y  yj{dy)j 

-4s. 

Since  s  is 

arbitrary,  we  obtain 

v  > 

inf 

7SC 

j  f(y)j(dy)  +  -R(tIIm)  +  l 

(y  yj(dy)j 

= 

inf  sup 

7GC  QeRd 

J  f(,y)ry(dy)  +  R(j\\n)  +  J  (a,y)j(dy)  -  H(a) 

=  V. 

This  completes  the  proof.  ■ 
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